Strip a leading byte order mark before parsing by hsbt · Pull Request #801 · ruby/psych

hsbt · 2026-06-12T01:25:47Z

Fixes #331.

YAML.load("\uFEFFa: b\nc: d") returns {"a" => "b"}, silently dropping everything after the first newline, and Psych.parse_stream raises Psych::SyntaxError on the same input. A BOM at the start of the stream is legal per YAML 1.2 §5.2.

Psych tells libyaml the input encoding whenever it is known, so libyaml's reader-level BOM stripping, which only runs during encoding auto-detection, never happens. The scanner skips the BOM instead but counts it as a first-line column, so every token on the first line shifts one column right and a root-level block mapping terminates at the second line. Reported upstream as yaml/libyaml#334.

This strips the BOM at the Ruby level before the input reaches libyaml: the first character of UTF-8/UTF-16 strings, and of seekable IOs whose external encoding is one of those. Binary strings and IOs are left untouched since they go through libyaml's auto-detection, which already handles the BOM correctly. JRuby is unaffected (snakeyaml-engine excludes the BOM from column counting) and the new tests pass on both implementations.

🤖 Generated with Claude Code

libyaml only discounts the BOM when it detects the stream encoding by itself. Psych passes the encoding explicitly whenever it is known, and on that path libyaml counts the BOM as a first-line character, shifting every token on the first line one column right and silently terminating a block mapping at the second line. #331 Co-Authored-By: Claude Fable 5 <[email protected]>

Copilot

Pull request overview

This PR fixes YAML parsing when an input stream begins with a Unicode byte order mark (BOM), aligning behavior with YAML 1.2 by ensuring BOMs don’t shift token columns and break multi-line root-level mappings.

Changes:

Strip a leading BOM in Psych::Parser#parse for UTF-8/UTF-16 inputs before handing data to libyaml.
Add regression tests for Psych.load, Psych.parse_stream, and Psych::Parser#parse covering multi-line mappings with BOM (UTF-8, UTF-16, and IO cases).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
lib/psych/parser.rb	Adds Ruby-level BOM stripping before invoking the native libyaml parser.
test/psych/test_psych.rb	Adds high-level regression tests for `load` and `parse_stream` with a leading BOM.
test/psych/test_parser.rb	Adds parser-level BOM regression tests (UTF-8/UTF-16/IO) and a helper for scalar extraction.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+      if String === yaml
+        bom = BOM[yaml.encoding]
+        return yaml[1..-1] if bom && yaml.start_with?(bom)
+      elsif yaml.respond_to?(:read) && yaml.respond_to?(:external_encoding) &&


If pos succeeded but the later seek failed, the rescue silently discarded the bytes read to check for a BOM. Only the initial pos call is expected to fail, for non-seekable IOs, before anything is consumed. Co-Authored-By: Claude Fable 5 <[email protected]>

The C extension transcodes UTF-32 strings to UTF-8 with the BOM preserved, so they were truncated the same way as UTF-8 input. Co-Authored-By: Claude Fable 5 <[email protected]>

Co-Authored-By: Claude Fable 5 <[email protected]>

Copilot AI review requested due to automatic review settings June 12, 2026 01:25

Copilot started reviewing on behalf of hsbt June 12, 2026 01:26 View session

Copilot AI reviewed Jun 12, 2026

View reviewed changes

Comment thread lib/psych/parser.rb

Comment on lines +80 to +83

if String === yaml

bom = BOM[yaml.encoding]

return yaml[1..-1] if bom && yaml.start_with?(bom)

elsif yaml.respond_to?(:read) && yaml.respond_to?(:external_encoding) &&

hsbt and others added 3 commits June 12, 2026 11:01

Strip the BOM from UTF-32 strings too

72eee65

The C extension transcodes UTF-32 strings to UTF-8 with the BOM preserved, so they were truncated the same way as UTF-8 input. Co-Authored-By: Claude Fable 5 <[email protected]>

Use String#delete_prefix to strip the BOM

681fb06

Co-Authored-By: Claude Fable 5 <[email protected]>

hsbt merged commit a1dcb86 into master Jun 12, 2026
164 checks passed

hsbt deleted the claude/priceless-cannon-38818f branch June 12, 2026 02:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strip a leading byte order mark before parsing#801

Strip a leading byte order mark before parsing#801
hsbt merged 4 commits into
masterfrom
claude/priceless-cannon-38818f

hsbt commented Jun 12, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hsbt commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hsbt commented Jun 12, 2026 •

edited

Loading